Intro to R for Medical Data

A companion document for the R/Medicine 2021 Virtual Conference’s Introduction to R for Medical Data workshop.

true , true
2021-08-14

Known issues

Welcome to R/Medicine 2021. In this workshop we will be doing a hands-on introduction to R for medical data.

Workshop basics

Why should I come to this workshop?

Set up for workshop

You’ll need the following during the workshop:

Please note while you can use a local version of R and Rstudio for the workshop, we cannot provide support for you during the workshop to address issues that come up your local device and software. We use RStudio Cloud because we can do all the technical preparation for participants ahead of time and spend our limited workshop time on learning R as fast as possible.

If you would like to some additional preparation for the workshop, I recommend learning some about Markdown in this ten minute tutorial.

Schedule

Introduction ~xxx minutes

Systems check ~xxx minutes

Learn and Do ~xxx minutes

Wrap Up and Discussion ~xxx minutes

Starting your journey in R

Orientation

RStudio Cloud

For this workshop we will be using RStudio Cloud.

Why? Because there is nothing that you need to download!

You don’t need to use RStudio to use R, but I do!

RStudio on your computer and RStudio Cloud look very similar.

R and RStudio are free to download to your computer.

Systems Check

Open up the project

Once you are in RStudio Cloud, click on the project called “Intro to R for Medical Data.”

RStudio Orientation: Basics

R Markdown Orientation

Let’s open the file in the file pane (number 4 in the image above) called Intro_to_R_for_medical_data_workshop.Rmd. This is what we will be working for the workshop.

Your document will look something like this.

YAML

The YAML header contains the special instructions on how to create the output document. We won’t do much with it here today, but it is a very powerful way to make your Rmd file as bespoke as you want it!

Code Chunks

Code chunks are where the code will go.

Code chunks have a gray background.

#This is a code chunk!
#Here is a simple calculation
1 + 2
[1] 3

You can run a code chunk by pressing the green play button.

Graphing Data

Data in a Excel spreadsheet?

No problem! R can ingest data from lots of locations including from Excel spreadsheets that you might already be very familiar with.

Load data into R

XXX smoke data from XXX with data dictionary here XXX put in link.

Run the code chunk below by presening that green play button.

Inspecting a data frame

One thing that Excel does well is to provide an interactive visual representation of the data. This allows you to inspect it by sorting and filtering. RStudio actually does this well, too, with one difference - it won’t let you change any of the data while you inspect it.

Look on the right at the Environment pane (you might have to click icon that looks like a spreadsheet on the “Environment” tab) and find the entry smoke_complete. This is the data frame you just created inside of R’s memory. (If you don’t see smoke_complete, try running the code chunk above again).

Within the Environment pane, click on the smoke_complete to view the data.

Go ahead and try to edit one of the values in this viewer. You will find that you can’t. It would have been easy for the RStudio programmers to allow editing of specific values, but they decided not to add that feature.

Why do you think this was designed that way?

Next we will use the glimpse() function to learn about our data.

How many rows of data do we have in our data?

How many columns are there? What are some of the column names?

Can you tell what kind of variable is stored in the columns?

Rows: 1,152
Columns: 20
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage i~
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 2337~
$ vital_status                <chr> "dead", "dead", "dead", "alive",~
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8~
$ days_to_death               <chr> "371", "136", "2304", "NA", "NA"~
$ state                       <chr> "live", "live", "live", "live", ~
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, ~
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_last_follow_up      <chr> "NA", "NA", "2099", "3747", "357~
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356~
$ years_smoked                <chr> "NA", "NA", "NA", "NA", "NA", "N~
$ gender                      <chr> "male", "male", "female", "male"~
$ year_of_birth               <chr> "1936", "1931", "1927", "1930", ~
$ race                        <chr> "white", "asian", "white", "whit~
$ ethnicity                   <chr> "not hispanic or latino", "not h~
$ year_of_death               <chr> "2004", "2003", "NA", "NA", "NA"~
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", ~
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", ~

Next, we will use the skim() function to learn even more about our data.

Press the green play button in the code chunk below.

Now can you tell the different column types we have? What types are in this data?

Look at the mini histograms at the bottom of the output? What can you learn very quickly about your data for the distribution cigarettes_per_year variable
Table 1: Data summary
Name smoke_complete
Number of rows 1152
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
primary_diagnosis 0 1 5 6 0 17 0
tumor_stage 0 1 7 12 0 11 0
vital_status 0 1 4 5 0 2 0
morphology 0 1 6 6 0 12 0
days_to_death 0 1 1 4 0 279 0
state 0 1 4 4 0 1 0
tissue_or_organ_of_origin 0 1 5 5 0 16 0
site_of_resection_or_biopsy 0 1 5 5 0 16 0
days_to_last_follow_up 0 1 1 4 0 459 0
years_smoked 0 1 1 2 0 48 0
gender 0 1 4 6 0 2 0
year_of_birth 0 1 2 4 0 67 0
race 0 1 5 41 0 6 0
ethnicity 0 1 12 22 0 3 0
year_of_death 0 1 2 4 0 21 0
bcr_patient_barcode 0 1 12 12 0 734 0
disease 0 1 4 4 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at_diagnosis 0 1 24175.38 3926.13 7855.00 22068.75 24750.50 26927.00 32872 <U+2581><U+2581><U+2583><U+2587><U+2582>
days_to_birth 0 1 -24175.38 3926.13 -32872.00 -26927.00 -24750.50 -22068.75 -7855 <U+2582><U+2587><U+2583><U+2581><U+2581>
cigarettes_per_day 0 1 2.61 2.04 0.01 1.37 2.19 3.29 40 <U+2587><U+2581><U+2581><U+2581><U+2581>

Plotting our data

Now we will take the data we have loaded and make some plots!

Produce a histogram of smoke_complete using geom_histogram(), mapping these variables to the following aesthetics:cigarettes_per_day to the x aesthetic.

plot_1 <- ggplot(data = smoke_complete,
                 aes(x =  cigarettes_per_day)) +
  geom_histogram()

plot_1

We’ve made a plot, but there are some issues with it:

Let’s make our plot a bit prettier.

You got a message after running that code that: “stat_bin() using bins = 30. Pick better value with binwidth.”

Let’s modify the width of our histogram’s bins in the code chunk below to 2. Then try 10.

Finally let’s explore the differences in daily cigarettes for the gender variable.

Below add gender to the facet

ggplot2: A Grammar of Graphics

ggplot2 is an extremely powerful software library for visualization.

The gg is short for Grammar of Graphics, which means that visualizations are expressed in a specific and consistent way for different types of visualizations in the package. There is a wonderful universe of complementary packages in R that extend ggplot to many different types of visualizations.

Here’s a visual summary of the different parts we’re talking about today. There are many parts to visualizations, but many of us don’t have the words to describe the different types of parts of a graph. Here we will spend some time breaking down the meeting of different constituent parts.

Note: there are different ways to write ggplot code to get the same output, sort of formal versus causal conversation. Here we will focus on the more formal style because it is the most explicit, but also more verbose! As you get familiar with ggplot, you will likely shorten your code a bit at times from this very explicit method. We won’t cover every layer in depth, but know that breaking down visualizations to these constituents allows for tremendous control.

Image from Thomas Lin Pedersen

Learning to read ggplot2 code

A ggplot2 graphic consists of a mapping of variables in data to the aes()thetic attributes of geom_etric objects.

In code, this is translated as:

    # start the plot object with ggplot()
    # add assign smoke_complete to the data argument

    ggplot(data = smoke_complete,   

    # map the variables to visual properties of the graph
           mapping = aes(
            
    # map the x-axis to age_at_diagnosis
              x = age_at_diagnosis, 
              
    # map the y-axis to cigarettes_per_day       
              y = cigarettes_per_day,
    
    # map the color to the disease variable
              color = disease
              )) +

    # add the geometry and the alpha
          geom_point(alpha = 0.2) + # complete the geom to geom_point
  
    # add labels to your plot

          labs(title = "Age of diagnosis of cancer by daily cigarette consumption", # plot title
               x="Age in days", # x-axis label
               y="Cigarettes per day", # y-axis label
               color = "Disease type") + # label for color legend
  
    # add a facet to your plot
  
          facet_grid(cols = vars(gender)) + # facet by gender variable
  
    # add a theme to your plot
  
          theme_bw()

Things to note: we chain these things together with + (plus sign).

Rebuild the prior plot

Set the data argument to smoke_complete, then run the code chunk.

Add mapping arguments to the aes function. Assign the variable age_at_diagnosis to x, cigarettes_per_day to y, and disease to color.

Run the code chunk. How does it look different from the prior code chunk?

Add geom_point().

Ok, wow, looking great!! Any major differences you notice between the graph provided before the step by step process in code chunk called full_example above?

In the next modification, modify the alpha argument to vary the opacity of the points. You can vary the value from 0 to 1. Try a few different values to see the difference.

Wait! What little trick was that? See the variable plot_5 in the code chunk above? That is storing your whole code with your nice new labels. Instead of continuing to rewrite your code, you can just modify as needed. Below we are going to facet by gender to separate out male and female to inspect the differences.

Finally we will play a bit with different themes. Pick two of the below themes and then add them to the plot_5 and run the code chunk to compare them.

Theme options:

There are plenty of out of the box themes in ggplot2, but you can make a theme, use other people’s, or even use a theme your organization might already have. Let’s load the `ggthemes’ library.

Time for a break!!

Part 2: Wraggling Data

Rarely is your data going to be in the form you need it to be analyzed and plotted. You will often need to wrangle your data and change the shape of it a bit.

Let’s discuss a bit different ways we might need to reshape data.

Discussion

There are different packages and ways people wrangle data with R, but we’re going to demonstrate using some packages from the tidyverse, which is a whole ecosystem of R packages organized around having tidy data.

select()

From our smoke_complete data, let’s select two columns to keep: gender and days_to_death.

# A tibble: 1,152 x 2
   gender days_to_death
   <chr>  <chr>        
 1 male   371          
 2 male   136          
 3 female 2304         
 4 male   NA           
 5 female NA           
 6 male   345          
 7 male   716          
 8 male   2803         
 9 male   973          
10 male   1097         
# ... with 1,142 more rows

There is a lovely package called magrittr that was loaded earlier, but includes something called a pipe, that looks like this in code: %>%. It allows use to call a nicely pipe data and preform lots of tasks on it.

Run the code below and inspect the output. How does it compare to the output from the prior code chunk?

# A tibble: 1,152 x 2
   gender days_to_death
   <chr>  <chr>        
 1 male   371          
 2 male   136          
 3 female 2304         
 4 male   NA           
 5 female NA           
 6 male   345          
 7 male   716          
 8 male   2803         
 9 male   973          
10 male   1097         
# ... with 1,142 more rows

filter()

Now let’s meet filter().

# A tibble: 2 x 20
  primary_diagnosis tumor_stage age_at_diagnosis vital_status
  <chr>             <chr>                  <dbl> <chr>       
1 C34.3             stage ib               19025 dead        
2 C34.3             stage ib               19025 dead        
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
#   state <chr>, tissue_or_organ_of_origin <chr>,
#   days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
#   days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
#   years_smoked <chr>, gender <chr>, year_of_birth <chr>,
#   race <chr>, ethnicity <chr>, year_of_death <chr>,
#   bcr_patient_barcode <chr>, disease <chr>

The pipe alternative way.

# A tibble: 2 x 20
  primary_diagnosis tumor_stage age_at_diagnosis vital_status
  <chr>             <chr>                  <dbl> <chr>       
1 C34.3             stage ib               19025 dead        
2 C34.3             stage ib               19025 dead        
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
#   state <chr>, tissue_or_organ_of_origin <chr>,
#   days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
#   days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
#   years_smoked <chr>, gender <chr>, year_of_birth <chr>,
#   race <chr>, ethnicity <chr>, year_of_death <chr>,
#   bcr_patient_barcode <chr>, disease <chr>

Together select() and filter() will be big workhorses in your data wraggling toolkit. However, at first it is easy to confuse which function does what.

select() selects the columns to stay.

filter() filters the rows to keep that meet certain conditions placed on columns. So in the example above, we filtered the data to only keep the rows where the bcr_patient_barcode was equal to “TCGA-18-3412.”

%>%

We meet the pipe %>% earlier, but let’s see how we can combine multiple pipes in a row to get a more complicated operation done.

# A tibble: 1,032 x 2
   age_at_diagnosis gender
              <dbl> <chr> 
 1            24477 male  
 2            26615 male  
 3            28171 female
 4            27154 male  
 5            23370 female
 6            19025 male  
 7            26938 male  
 8            28430 male  
 9            30435 male  
10            24019 male  
# ... with 1,022 more rows

mutate()

Let’s make age_at_diagnosis more human friendly by creating a new column that divides the days by 365.25. (There are great R packages to handle data, times, duration, intervals, and all those other message time and date issues!)

# A tibble: 1,152 x 2
   age_at_diagnosis age_at_diagnosis_years
              <dbl>                  <dbl>
 1            24477                   67.0
 2            26615                   72.9
 3            28171                   77.1
 4            27154                   74.3
 5            23370                   64.0
 6            19025                   52.1
 7            26938                   73.8
 8            28430                   77.8
 9            30435                   83.3
10            24019                   65.8
# ... with 1,142 more rows

A moment for beautiful tables

There are several great packages that can make lovely tables that are publication ready.

One quick example here of our prior code chunk output table_1. We take that same object and let a package called kableExtra work some magic on it!

age_at_diagnosis age_at_diagnosis_years
24477 67.01437
26615 72.86790
28171 77.12799
27154 74.34360
23370 63.98357
19025 52.08761
26938 73.75222
28430 77.83710
30435 83.32649
24019 65.76044

Quick preview rmarkdown

R Markdown documents, like this document, allow you to place text and analysis with the code all in a single document and output the result into different formats such as an html webpage, a pdf, or even a word document.

Let’s quickly knit a document to seee the output. In pane 4, or the file pane, find the document called knit_preview.Rmd and open it. We will see examples of knitting the same .Rmd file to a word document, a pdf, and to html.

Wrap up

How to search for help

Within R you can type ? followed by the name of the function eg ?filter() in the console.

In RStudio, you can look at the help tab in pane 4.

However, in the end your best resource is probably going to be searching online for the issue. From online forms, to blog posts, to twitter threads there is a ton of content out there, but crafting a good search query is a workshop in itself!

Further Reading and Resources

ggplot2: Elegant Graphics for Data Analysis

R for Data Science

R Graphics Cookbook

The list is always growing! The trouble isn’t finding good resources, it’s finding time to read them!

Acknowledgments

This work was made possible by the distill, ggplot, dplyr, and rmarkdown packages.

Author Contributions


  1. https://support.rstudio.com/hc/en-us/articles/227449447-Supported-browsers-for-RStudio-Connect↩︎

References